Chapter 2 — From Transcription to Digital Text

We are faced with a problem. There is now a machine readable text, but it is littered with stuff we do not want. To create their edition, Bouwman et al. added line numbers, a translation in English, a plethora of footnotes and annotations. There's also remnants of the print book: page numbers, page headers, etc. If a full computational edition of the Reynaert edition by Bouwman et al. would be th goal of this project, we might actually be interested in these elements and we might want to capture them. However here I am interested in the meticulous reproducibility of the scholarly process of editing through code and in reading the text of the Reynaert through code. Capturing the book metaphor that is 'inbuilt' in the exiting edition is less relevant to this direct aim, though it would be an interesting later project to pursue the mise-en-abyme of computationally creating the digital scholarly edition of the scholarly print edition. But right now the task is to separate the Middle Dutch verses from the rest.

Of a parser and models

The approach I take is rather straight forward. I will read in the text line by line and I will see if each line matches to a certain model. So we will need models and some machinery to have the text be matched for those models. The latter piece of machinery we usually call a parser. The models we will just call models.

The super model

So, we know that we will need several models, pieces of code that can recognize footnotes, empty lines, page numbers, etc. for us. Thus a case where we have several objects of the category model. And each model will have to be able to match itself against each line of a text. We express this commonality via a super class Model. All concrete models for matching will be variant of that.


In [1]:
class Model

  # Determines if the model matches a line of text.
  # By default it returns false because it doesn't match anything.
  def matches( line )
    false
  end

end


Out[1]:
:matches

But wait.. not all models will apply to exactly one line. Remember those footnotes? Those are multiline phenomena. We will need some way of registering or knowing that a model is terminated. Thus we add a variable and a way to read it to the super model. Each derived concrete model will have the ability to 'know' by which other models it is terminated.


In [2]:
class Model

  # A class instance variable that holds a list of 
  # other models that terminates this model. 
  @terminators = nil
  def self.terminators
    @terminators
  end

  # Determines if the model matches a line of text.
  # By default it returns false because it doesn't match anything.
  def matches( line )
    false
  end

end


Out[2]:
:matches

Concrete models

Now we need several concrete models that will enable us to categorize lines in the text. Looking at the text we see that there are a number of 'types' of lines that we don't need. Lines that contain only numbers (page number or verse numbers) for instance, lines that are in all capital font and that coincide with page headers and chapter headings, lines belonging to footnotes, and lastly empty lines. We can express this by creating concrete model classes that implement the matches method of the super class in specific ways. Thus we end up with four models (AllCaps, FootNote, Numbers, and Empty) that each use a different regular expression to match the text surfaces that are typical for each type of line. You'll find these regular expressions as the red parts below in each class (e.g. /[[:upper:]]/, which matches upper case letters). These expression if not encountered before may seem hermetic, but with a bit of study effort they will be sufficiently understandable.


In [3]:
# Matches a line that only contains capitals.
class AllCaps < Model

  def matches( line )
    !!line.match( /[[:upper:]]/ ) && !!!line.match( /[[:lower:]]/ )
  end

end

# Matches a line starting with at least one digit, followed by a dash or a space.
class FootNote < Model

  def matches( line )
    line.match( /^\d+(-| )(.+)$/ ) != nil
  end

end

# Matches a line containing only numbers. 'o' (lower case letter o) is also
# accepted as the OCR frequently misreads 0 for o.
class Numbers < Model

  def matches( line )
    line.match( /^[\do]+$/ ) != nil
  end

end

# Matches an empty line.
class Empty < Model

  def matches( line )
    line.match( /^\s*$/ ) != nil
  end

end


Out[3]:
:matches

Differentiating Middle Dutch from English

All the line types we have seen until now have some recognizable features (they're empty, contain numbers, and so forth). When it comes to telling apart "dat die avonture van Reynaerde" from "that the tales of Reynaert", we are lost for visual clues at the surface of the text only. We will need some more knowledge to identify the former as Middle Dutch and the latter as English. The 'English model' is therefore quite somewhat more complicated than the other classes. It does not need to get as complicated as using sophisticated natural language processing (NLP) software packages. An admittedly naive but straight forward approach is to use a list of English stop words. If a line is made up for more than 20% (or differently put: if is passes a 0.2 threshold of words in English) of such stop words we can safely assume that the line is in English. There are some subtleties that might be worth noting, to point these out commentary is provided within the code of the class below.


In [4]:
class English < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  def initialize
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" )
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # The standard match function that all models must provide.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  def matches( line )
    score( line ) > @threshold
  end

end


Out[4]:
:matches

Adding an engine

Now that we have all these models we need something that will take an actual text, set the models lose on it and returns us just the Middle Dutch verses that we were looking for. This piece of machinery we will call the OCRParser. The OCRParser class takes a text (method load_text) and splits it on line breaks (method text=). Then it delegates the matching of lines to the models described above in the method match_lines.

The method parse_tuples considers the possibilities of models spanning multiple lines. It keeps track of which multiline models are active and checks if a new model that is matched maybe terminates any of the active multiline models. It then returns all lines and adds to each an indicator whether it was matched by a model or not (true, false). It also adds a list of the models that matched the line.

The parse method filters that result and returns only those lines that did not answer to any model, which should be only the Middle Dutch verses.


In [5]:
class OCRParser

  attr_accessor :models

  def text=( text )
    @text = text
    @lines = text.split( "\n" )
  end

  def load_text( file_path )
    self.text = File.read( file_path )
  end

  def match_lines
    @lines.each do |line|
      matches = []
      @models.each do |model|
        if model.matches( line )
          matches.push( model.class )
        end
      end
      yield line, matches
    end
  end

  def parse_tuples
    active_multiline_models = []
    match_lines do |line, matches|
      matches.each do |model|
        active_multiline_models.reject! do |active_multiline_model|
          active_multiline_model.terminators.include? model
        end
        if model.terminators != nil
          active_multiline_models.push( model )
        end
      end
      if matches.size == 0 && active_multiline_models.size == 0
        yield true, line, matches
      else
        yield false, line, matches
      end
    end
  end

  def parse
    tuples = []
    parse_tuples { | accept, line | tuples.push line if accept }
    tuples
  end

end


Out[5]:
:parse

Kicking it into life

We have all the different parts now that yield the Middel Dutch text only to us. All that is left to do is to instantiate a new OCRParser, feeding it a text and the models, and the requesting the result. That's what the next little snippet finally does.


In [6]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, English.new ]
parsed = text.parse()
puts parsed.join( "\n" )


Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
dat die avonture van Reynaerde
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
Prologue
among people.
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontuere van Reynaerde.
Al begripic die grongaerde
ende die dorpren ende die doren,
ic wille dat dieghene horen
die gherne pleghen der eeren
ende haren zin daertoe keeren
dat si leven hoofschelike,
sijn si arem, sijn si rike,
diet verstaen met goeden sinne.
Nu hoert hoe ic hier beghinne!
dat beede bosch ende haghe
met groenen loveren waren bevaen.
Nobel die coninc hadde ghedaen
sijn hof crayeren overal,
dat hi waende, hadde hijs gheval,
houden ten wel groeten love.
Doe quamen tes sconinx hove
alle die diere, groet ende cleene,
sonder vos Reynaert alleene.
Hi hadde te hove so vele mesdaen
dat hire niet dorste gaen.
Die hem besculdich kent, ontsiet.
ende hieromme scuwedi sconinx hof,
daer hi in hadde crancken lof.
Doe al dat hofversamet was,
was daer niemen, sonder die das,
hi ne hadde te claghene over Reynaerde,
King Nobel holds court
den fellen metten grijsen baerde.
Nu gaet hier up eene claghe.
Isingrijn ende sine maghe
ghinghen voer den coninc staen. [193ra]
Ysengrijn begonste saen
ende sprac: ‘Coninc heere,
dor hu edelheit ende dor hu eere
ende dor recht ende dor ghenade,
ontfaerme hu miere scade
die mi Reynaert heeft ghedaen,
daer ic af dicken hebbe ontfaen
groeten lachter ende verlies.
Voer al dandre ontfaerme hu dies
dat hi mijn wijfhevet verhoert
dat hise beseekede daer si laghen,
datter twee noint ne saghen
ende si worden staerblent.
Nochtan hoendi mi sent.
datter eenen dach af was ghenomen
ende Reynaerd soude hebben ghedaen
sine onsculde. Ende also saen
alse die heleghe waren brocht,
ende ontfoer ons in sine veste.
Heere, dit kennen noch die beste
die te hove zijn commen hier.
Mi hevet Reynaert, dat felle dier,
charges against Reynaert. Medieval feudal society initially lacked a strong, central source
is dissembling (cf. p. 33).
so vele te leede ghedaen,
ic weet wel al sonder waen:
al ware al tlaken paerkement
dat men maket nu te Ghent,
inne ghescreeft niet daeran.
Dies zwijghics nochtan,
neware mijns wives lachter
ne mach niet bliven achter,
Doe Ysengrijn dit hadde ghesproken,
stont up een hondekijn, hiet Cortoys,
ende claghede den coninc in Francsoys
dat alles goets en hadde meere
dan alleene eene worst
ende hem Reynaert, die felle man, [193rb]
die selve worst stal ende nam.
Tybeert die cater die wart gram.
Aldus hi sine tale began
ende spranc midden in den rinc
ende seide: ‘Heere coninc,
dordat ghi Reynaerde zijt onhout,
hi ne hebbe te wroughene jeghen hu.
Dat Cortoys claghet nu,
dats over menich jaer ghesciet.
Ic hadse bi miere lust ghewonnen
daer ic bi nachte quam gheronnen
omme bejach in eene molen,
daer ic die worst in hadde ghestolen
eenen slapenden molenman.
happened many a year ago.
dan was bi niemene dan bi mi.
Hets recht dat omberecht zi
die claghe die Cortoys doet.’
Pancer de bever sprac: ‘Dinct hu goet,
Tybeert, dat men die claghe ombeere?
Reynaert es een recht mordeneere
ende een trekere ende een dief.
Hi ne heeft oec niemene so lief,
no den coninc, minen heere,
hi ne wilde dat hi lijf ende eere
een vet morzeel van eere hinnen.
Wat sechdi van eere laghe?
En dedi ghistren in den daghe
eene die meeste overdaet
an Cuwaerde den hase, die hier staet,
die noyt eenich dier ghedede?
Want hi hem binnen sconinX vrede
ende binnen des coninX gheleede
ghelovede te leerne sinen crede
ende soudene maken capelaen.
Doe dedine sitten gaen
vaste tusschen sine beene.
Doe begonsten si overeene
spellen ende lesen beede [193va]
ende lude te zinghene crede.
Mi gheviel dat ic te dien tijden
ter selver stede soude lijden.
Doe hoerdic haerre beeder sanc
ende maecte daerwaert minen ganc
met eere arde snelre vaerde.
Doe vandic daer meester Reynaerde,
die ziere lessen hadde begheven
die hi tevoren up hadde gheheven,
ende diende van sinen houden spelen
ende hadde Coewaerde bi der kelen
ende soude hem thoeft afhebben ghenomen
waer ic hem niet te hulpen comen
bi avontueren in dien stonden.
Siet hier noch die verssche wonden
ende die teekine, heere coninc,
die Coewaert van hem ontfinc.
Laetti dit bliven onghewroken,
dat hu verde dus es tebroken,
ghi ne wreket als huwe mannen wijsen,
men saelt huwen kindren mesprijsen
hiernaer over wel menich jaer.’
‘Bi Gode, Pancer, ghi secht waer,’
sprac Ysengrijn daer hi stoet.
‘Heere, waer Reynaerd doot, het waer ons goet,
also behoude mi God mijn leven.
Neware wert hem dit vergheven,
hi sal noch hoenen binnen eere maent
sulken dies niet ne bewaent.’
Doe spranc up Grinbert die das,
met eere verbolghenlike tale:
‘Heere Ysengrijn, men weet dat wale
ende hets een hout bijspel:
viants mont seit selden wel.
Verstaet, neemt miere talen goem:
ic wilde, hi hinghe an eenen boem
bi ziere kelen als een dief
die andren heeft ghedaen meest grief.
‘Lord Ysingrijn, as everyone surely knows
Heere Ysengrijn, wildi angaen
soendinc ende dat ontfaen,
daertoe willic helpen gherne. [193vb]
Mijn oem en saelt hem oec niet wernen.
Entie meest andren heeft mesdaen
sal den andren in baten staen
van minen oem ende van hu.
Al comt hi niet claghen nu,
ware mijn oem wel te hove
ende stonde in sconinx love,
heere Ysengrijn, als ghi doet,
en soude den coninc niet dincken goet
ende ghi ne bleves heden onbegrepen,
dat ghi sijn vel so hebt ghenepen
so dicwile met huwen scerpen tanden,
dat hi niet ne conde ghehanden.’
Ysengrijn sprac: ‘Hebdi gheleert
an huwen oem dus lieghen apeert?’
‘In hebbe daeran niet gheloghen.
Ghi hebt minen oem bedroghen
arde dicke in menegher wijsen.
Ghi mesleettene van den pladijse
die hi hu warp van der kerren,
doe ghi hem volghet van verren
ende ghi die beste pladijse uplaset,
daer ghi hu ane hadt versadet.
sonder alleene eenen pladijsengraet
dat ghi hem te jeghen brocht,
dordat ghine niet en mocht.
Sint hoendine van eenen bake
die vet was ende van goeder smake,
dien ghi leit in huwen muzeele.
cart, leaving nothing but the bones ofone single fish (cf. p. 31—32).
Doe Reynaert heesschede zijn deele,
“Hu deel willic hu gheven gherne,
Reynaert, scone jonghelinc!
Die wisse daer die bake an hinc,
Reynaerde waes lettel te bet
dat hi den goeden bake ghewan
in sulker zorghen, dattene een man
vinc ende warpene in sinen zac.
Dese pine ende dit onghemac
hevet hi leden dor Ysengrijne [194ra]
ende ondert waerven meer dan ic hu rijme.
Ghi heeren, dinct hu dit ghenouch?
Nochtan om meer onghevouch
dat hi claghet om sijn wijf,
die Reynaerde hevet al haer lijf
ghemint; so doet hi hare.
Al ne makeden zijt niet mare,
ic dart wel segghen over waer
dat langher es dan VII jaer
dat Reynaert hevet hare trauwe.
Omdat Haersint, die scone vrouwe,
dor minne ende dor quade zede
Reynaert sinen wille dede,
Wat talen mach daeromme wesen?
Nu maket heere Cuwaert, die hase,
eene claghe van eere blase.
Of hi den credo niet wel en las,
Reynaerd, die zijn meester was,
mochte hi sinen clerc niet blauwen?
Dat ware onrecht, entrauwen.
1988.
Reynaert, my dear young man!
accommodated Reynaert
Now Lord Cuwaert, the hare,
Cortoys claghet om eene worst
die hi verloes in eene vorst.
Die claghe ware bet verholen:
Male quesite male perdite:
over rechtwert men qualike quite
dat men hevet qualic ghewonnen.
Wie sal Reynaerde dat verjonnen
Niemen die recht versceeden can.
Reynaert es een gherecht man.
Sint dat die coninc sinen ban
hevet gheboden ende sinen vrede,
so weetic wel dat hi ne dede
dinc negheene dan of hi ware
hermite ofte clusenare.
Naest siere huut draecht hi een hare.
Binnen desen naesten jare
Dat seidi die ghistren danen quam.
Malcroys hevet hi begheven, [194rb]
sinen casteel, ende hevet upheven
eene cluse daer hi leghet in.
Ander bejach no ander ghewin
so wanic wel dat hi ne hevet
dan karitate die men hem ghevet.
Bleec es hi ende magher van pinen.
Hongher, dorst, scerpe karijnen
doghet hi voer sine zonden.’
Recht te desen selven stonden,
doe Grimbert stont in dese tale,
saghen si van berghe te dale
Canticler commen ghevaren,
ende brochte up eene bare
eene doode hinne ende hiet Coppe,
goods’, or ‘stolen goods never thrive’.

But wait! That is not correct. Anyone who knows the Reynaert will spot that something is off already in the first few line. The Reynaert (in the Comburg manuscript) reads:

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.

But our parser gives us:

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
dat die avonture van Reynaerde
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.

It missed the lines "hem vernoyde so haerde" and "in Dietsche onghemaket bleven". And also apparently it kept on to some English as well, a long part of a footnote is found lodged withing the text: "cart, leaving nothing but the bones ofone single fish (cf. p. 31—32)." What happened?

On closer inspection it turns out that "hem vernoyde so haerde" contains an Middle Dutch word that is also an English stop word ('so'). And because the verse is so short, it's relative English score hits the threshold of 0.5. Converse the English footnote text has an OCR misreading which 'hides' two English stop words ("ofone"), which keeps it under the threshold.

Clearly our English parsing model is not up to par yet. We will have to iterate the code through a new development cycle to improve the performance.


In [ ]: